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Typical DS pipeline 


1 2 3 4 


Data exploration Feature engineering ML model building Deployment 


Ways to organize work 


Cloud VM & SSH 
JupyterHub 
Kubeflow 


KF Pipelines/Metaflow/etc 


Managed ML platforms 


Resources granularity 


Cloud VM & SSH — ~ VM 
JupyterHub —————= Container 
Kubeflow —- Container/API calls 


KF Pipelines/Metaflow/etc —————— Pipeline 


Managed ML platforms ———— VM/Container/Pipeline 


VM/Container properties 


Manual lifecycle 
| Inefficient utilization 
Complex migration 


Utilization in numbers 


~] 2% VMs ~35% Containers 


Inefficient = expensive 


4 CPU 8 CPU VERY EXPENSIVE 
32GB RAM 64GB RAM 


Data exploration Feature engineering Model building 


VERY EXPENSIVE GPU per hour x work time (Sh) 


Scalability issues 


GPU number 


Team size 


Restriction policies are painful 


Work good for small teams 


50 Data Scientists hardly 
can peacefully negotiate 


aCCESSHO 


Oh, wait. Serverless? 


op 


Service instance 1 


Resources allocation 


on-demand 
C — a | Balancer / Scheduler | —————————> CH = 


Client Service instance 2 State storage 


op 


Service instance 3 


Production service vs 
data science workflow 


Fixed env 
Standard artifacts (docker) 
State in database Me 


Should scale out 


Dynamic env (pip instal) 
Arbitrary code on a local laptop 
Local state 


Should scale up 


Serverless 
for DS Is 
painful 


SDK requires heavy 
code rewriting 


Startup time can be slow due to 
complex env and state 


Resources are usually 
allocated per pipeline 


Example 


Define a standalone Python function. 
This function must meet the following requirements 


e |t should not use any code declared outside 
of the function definition 


e Import statements must be added inside the function 


e Helper functions must be defined inside this function 


def my_divmod(dividend: float, divisor: float) -> NamedTup le 
# Import the numpy package inside the component function 
import numpy as np 


# Define a helper function 
def divmod_helper(dividend, divisor): 
return np.divmod(dividend, divisor) 


Can serverless be painless? 


1 2 3 


Existing code reuse UX comparable with Fine-grained resources 
conservative tools allocation 
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Way #1. Serverless jupyter 


Load data 


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" 
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] 6 


dataset = pd.read_csv(url, names=names) 


Visualize data 
dataset.hist() d > 


plt.show() 


petal-length petal-width 


sepal-width? 


= sepal-fength e 


Build model 


array = dataset.values 
X = arrayl:, 0:4) 

Y = array[:, 4] 
validation_size = 0.20 —__—S— CPU+GPU 
seed = 7 

X_train, X validation, Y_train, Y_validation = \ 

model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed) 


[17]: kfold = model_selection.KFold(n_splits=10, random_state=seed) 
cv_results = model_selection.cross_val_score Y 
(SVC(gamma='auto'), X_train, Y_train, cv=kfold, scoring='accuracy') 
msg = "Ss: sf (%f)" % ('Accuracy', cv_results.mean(), cv_results.std()) 
print (msg) 


H Accuracy: 0.991667 (0.025000) 


Demo ar 


a + X ÖD Ô > m Code vw  S(4cores) S | Import ipynb S instance is ready 


[ ]: from catboost import CatBoostClassifier 
import numpy as np 


Ly 
[ ]: train_data = np. load('train_data.npy') 
train_labels = np. load('train_labels.npy') 


1: model = CatBoostClassifier(iterations=1000, 
task_type="CPU", 
devices='0:1') 


1: model.fit(train_data, train_labels, verbose=True) 


What is the state? 


Structured data in database 


Production service 


Jupyter FS + python interpreter 


State: file system 


Mount disks/NFS directly 
to workers 


e Disks/NFS mounted C jupyter —— ( scneduier ] 
to executors 


e No concurrent access 
= 


State: python variables 


1 2 3 


Complex structure Can be BIG No concurrent access 
Limited by RAM 


How to save interpreter state 


Serialization (pickle?) CRIU snapshots 22? 


Blob storage/Disk 


Pitfalls For correct work we need to serialize ALL 
variables, but most of them are temporal 


CRIU snapshots processes and does not 
support lazy loading 


State growth problem 


All these variables will be in the state! 


State size 


return full val loss / (overall sequence length * 88) 


In [8]: clip = 1.0 
epochs number = 12000000 
sample_history = [] 
best val loss = float("inf") 


for epoch number in xrange(epochs number): 


for batch in trainset_loader: 


post processed batch tuple = post process sequence batch(batch) 


input sequences batch, output sequences batch, sequences lengths = post processed batch Cu 
ple 


output sequences batch var = Variable( output sequences batch.contiguous().view(-1).cuda 


Session time 


serverless 
Jupyter: overview 


e Compatible with vanilla Jupyter 


e Works pretty good for simple state 
(popular libs) 


e Can be painful for users 
with complex env 


DataSphere — our serverless 
Jupyter implementation 


https://cloudil.co.il 
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Way #2. Cluster injections 


Gop 

def dataset() -> Bunch: 
raw_data = load _breast_cancer() re CPU 
data = clean data(raw data set) 
return data set 


Qop(gpu=Gpu.any()) 
def train(data set: Bunch) -> Classifier: 
model = CatBoostClassifier( 
iterations=1000, task_type="GPU" 


model. fit(data_set.data, data.target) 
return Ch model 


— CPU+GPU 


data = dataset() 
model = train(data) 


D e | | | O fla, catboost_whiteboard.py x 


catboost import CatBoostClassifier 
lzy.api.v1 import Gpu, LzyRemoteEnv, op 
sklearn import datasets 
sklearn.model_selection import GridSearchCV 
sklearn.utils import Bunch 


def dataset() -> Bunch: 
data_set = datasets.load_breast_cancer() 


DOO IP AI P OG N PA 


return data_set 


search_best_model(data_set: Bunch) -> GridSearchCV: 

grid = {"max_depth": [3, 4], "n_estimators": [100, 200]} 

cb_model = CatBoostClassifier(train_dir="/tmp/catboost") 

search = GridSearchCV(estimator=cb_model, param_grid=grid, scoring="accuracy", cv=3) 
search.fit(data_set.data, data_set.target) 

return search 


data = dataset() 
model = search_best_model(data) 


dataset() 


Cluster 
Injection 
principles 


No IDE binding 


Minimal changes in existing code 
Not fair for Notebooks 


Automatic environment migration 


Hybrid execution 


Function as a 
computational unit 


Operation is an ordinary 
python function with type 
annotations 


Decorators (meta-information) 
are used for hardware 
requirements 


II 


Cop (gpu=Gpu.any ()) 

def train(data set: Bunch) -> CatBoostClassifier: 
cb_model = CatBoostClassifier(iterations=1000) 
cb model.fit(data set. data, data set.target) 
return cb model 


A 
State m fu NC def solve_lorenz(sigma=10.0, beta=8./3, rho=28.0): 


"""Plot a solution to the Lorenz dif ati e 


fferential equations. 
max Lime = 4.0 
N = 30 


fig = plt.figure() 


ax = fig.add_axes([0, 0, 1, 1], projection='3d') 
Va ues ax.axis('off') 


# prepare the axes li 
ax.set_xlim((-25, 25)) 
ax.set_ylim((-35, 35)) 
ax.set_zlim((5, 55)) 


def lorenz_deriv(x_y_z, t0, sigma=sigma, beta=beta, rho=rho): 
"""Compute the time-derivative of a Lorenz system." 
X, Y; ZS KAZ 


> Only arguments and return No need return [sigma * (y - x), x x (rho - z) - y, x x y - beta x z] 
values must be serializable to serialize # Choose random starting points, uniformly distributed from -15 to 1: 


np. random. seed) 
all these xð = -15 + 30 * np.random.random((N, 3)) 


e Temporal variables live only variables! 
during execution 


# Solve for the trajectories 

t = np. linspace(®, max_time, int(250x*max_time)) 

x_t = np.asarray([integrate.odeint(lorenz_deriv, x0i, t) 
for x0i in x0]) 


# choose a different color for each trajectory 
colors = plt.cm.viridis(np.linspace(0, 1, N)) 


for i in range(N): 
Ki Yr ZS Elleke 
lines = ax.plot(x, y, z, '-', c=colors[il) 
plt.setp(lines, linewidth=2) 

angle = 104 

ax.view_init(30, angle) 

plt.show() 


return t, x_t 


Un 


Automatic env migration 


import mylib Add pip install command 


to conda.yaml 
Installed from pypi? y 


Detect module dirs 
and transfer them as archive 


Hybrid execution 


Executor 1 


mm ————> | Scheduler Executor 2 


Executor 3 


SY—— 


— Data plane  — Control plane 


Pitfalls 


Complex environments 
cannot be captured 
automatically 


Need for code migration 
from notebooks 


Local data uploading to 
storage can be slow 


For complex environments, 
it is possible to override 
docker image 


Small changes if code is written 
in procedural style 


We can cache data during a working 
session 


Cluster injections 
overview 


e No IDE binding 
e No environment/OS binding 


e Existing code friendly for Jupyter 
haters 


e Can be painful for Jupyter lovers 


AZy — our open-source cloud 
injections lib over k8s 


clck.ru/32sZ8x 
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Optimizations 


Huge conda/docker — Cache on SSD disks (we can run 
environment docker with layers on another device) 


VMs creation can be slow —————— Common pools of “hot” VMs 


Some operations can be —~ Actually all @op-function calls are 
run in parallel lazy :) 


Performance 


Utilization -95-99% an Brea eer 


Median operation startup time 
~10 sec 


Nov 25 Nov 26 Nov27 ~~ Nov28  ..  Nov29  Nov30 Deen Deep Deep Deen + #DecO5 Dec06 


The very last slide... almost 


Business 


Costs optimization 
i _—— Serverless DS 
Friendly UX A 


DataScientists 


Leave your feedback! 
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